A Memory Soft Error Measurement on Production Systems

نویسندگان

  • Xin Li
  • Kai Shen
  • Michael C. Huang
  • Lingkun Chu
چکیده

Memory state can be corrupted by the impact of particles causing single-event upsets (SEUs). Understanding and dealing with these soft (or transient) errors is important for system reliability. Several earlier studies have provided field test measurement results on memory soft error rate, but no results were available for recent production computer systems. We believe the measurement results on real production systems are uniquely valuable due to various environmental effects. This paper presents methodologies for memory soft error measurement on production systems where performance impact on existing running applications must be negligible and the system administrative control might or might not be available. We conducted measurements in three distinct system environments: a rack-mounted server farm for a popular Internet service (Ask.com search engine), a set of office desktop computers (Univ. of Rochester), and a geographically distributed network testbed (PlanetLab). Our preliminary measurement on over 300 machines for varying multi-month periods finds 2 suspected soft errors. In particular, our result on the Internet servers indicates that, with high probability, the soft error rate is at least two orders of magnitude lower than those reported previously. We provide discussions that attribute the low error rate to several factors in today’s production system environments. As a contrast, our measurement unintentionally discovers permanent (or hard) memory faults on 9 out of 212 Ask.com machines, suggesting the relative commonness of

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spatial prediction of soil electrical conductivity using soil axillary data, soft data derived from general linear model and error measurement

     Indirect measurement of soil electrical conductivity (EC) has become a major data source in spatial/temporal monitoring of soil salinity. However, in many cases, the weak correlation between direct and indirect measurement of EC has reduced the accuracy and performance of the predicted maps. The objective of this research was to estimate soil EC based on a general linear model via using se...

متن کامل

Architectural-Level Soft-Error Modeling for Estimating Reliability of Computer Systems

This paper proposes a soft-error model for accurately estimating reliability of a computer system at the architectural level within reasonable computation time. The architectural-level soft-error model identifies which part of memory modules are utilized temporally and spatially and which single event upsets (SEUs) are critical to the program execution of the computer system at the cycle accura...

متن کامل

Proposing an Efficient Software-based Method to Enhance Reliability of Computer Systems against Soft Errors

In recent years, along with rapid developments in technology, computer systems haveincreasingly become more integrated and more modular. Indeed, the reliability and efficiency ofcomputer systems are of high significance. Hence, the quantitative evaluation of the optimizationof reliability indexes in computer systems is considered to be a crucial issue. Reliabilityenhancement of computer systems...

متن کامل

Fast Reconstruction of SAR Images with Phase Error Using Sparse Representation

In the past years, a number of algorithms have been introduced for synthesis aperture radar (SAR) imaging. However, they all suffer from the same problem: The data size to process is considerably large. In recent years, compressive sensing and sparse representation of the signal in SAR has gained a significant research interest. This method offers the advantage of reducing the sampling rate, bu...

متن کامل

Proposing an Efficient Software-Based Method for Enhancing the Reliability of Critical Application Robot

Robots play such remarkable roles in humans’ modern lives that performing many tasks without them isimpossible. Using robotic systems is gradually increasing the tasks allocated to them and they are becomingmore complex and critical. Software reliability is one of the most significant requirements of robots. Forenhancing reliability, systems should be inherently designed to be tolerable of soft...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007